{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 任务说明\n", "\n", "- 任务主题:论文代码统计,统计所有论文出现代码的相关统计;\n", "- 任务内容:使用正则表达式统计代码连接、页数和图表数据;\n", "- 任务成果:学习正则表达式统计;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据处理步骤\n", "\n", "在原始arxiv数据集中作者经常会在论文的`comments`或`abstract`字段中给出具体的代码链接,所以我们需要从这些字段里面找出代码的链接。\n", "\n", "- 确定数据出现的位置;\n", "- 使用正则表达式完成匹配;\n", "- 完成相关的统计;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 正则表达式\n", "\n", "正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串、将匹配的子串替换或者从某个串中取出符合某个条件的子串等。\n", "\n", "#### 普通字符:大写和小写字母、所有数字、所有标点符号和一些其他符号\n", "\n", "| 字符 | 描述 |\n", "| ---------- | ------------------------------------------------------------ |\n", "| **[ABC]** | 匹配 [...] 中的所有字符,例如 [aeiou] 匹配字符串 \"google runoob taobao\" 中所有的 e o u a 字母。 |\n", "| **[^ABC]** | 匹配除了 **[...]** 中字符的所有字符,例如 **[^aeiou]** 匹配字符串 \"google runoob taobao\" 中除了 e o u a 字母的所有字母。 |\n", "| **[A-Z]** | [A-Z] 表示一个区间,匹配所有大写字母,[a-z] 表示所有小写字母。 |\n", "| . | 匹配除换行符(\\n、\\r)之外的任何单个字符,相等于 **[^\\n\\r]**。 |\n", "| **[\\s\\S]** | 匹配所有。\\s 是匹配所有空白符,包括换行,\\S 非空白符,包括换行。 |\n", "| **\\w** | 匹配字母、数字、下划线。等价于 [A-Za-z0-9_] |\n", "\n", "#### 特殊字符:有特殊含义的字符\n", "\n", "| 特别字符 | 描述 |\n", "| :------- | :----------------------------------------------------------- |\n", "| ( ) | 标记一个子表达式的开始和结束位置。子表达式可以获取供以后使用。要匹配这些字符,请使用 \\( 和 \\)。 |\n", "| * | 匹配前面的子表达式零次或多次。要匹配 * 字符,请使用 \\*。 |\n", "| + | 匹配前面的子表达式一次或多次。要匹配 + 字符,请使用 \\+。 |\n", "| . | 匹配除换行符 \\n 之外的任何单字符。要匹配 . ,请使用 \\. 。 |\n", "| [ | 标记一个中括号表达式的开始。要匹配 [,请使用 \\[。 |\n", "| ? | 匹配前面的子表达式零次或一次,或指明一个非贪婪限定符。要匹配 ? 字符,请使用 \\?。 |\n", "| \\ | 将下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符。例如, 'n' 匹配字符 'n'。'\\n' 匹配换行符。序列 '\\\\' 匹配 \"\\\",而 '\\(' 则匹配 \"(\"。 |\n", "| ^ | 匹配输入字符串的开始位置,除非在方括号表达式中使用,当该符号在方括号表达式中使用时,表示不接受该方括号表达式中的字符集合。要匹配 ^ 字符本身,请使用 \\^。 |\n", "| { | 标记限定符表达式的开始。要匹配 {,请使用 \\{。 |\n", "| \\| | 指明两项之间的一个选择。要匹配 \\|,请使用 \\|。 |\n", "\n", "#### 限定符\n", "\n", "| 字符 | 描述 |\n", "| :---- | :----------------------------------------------------------- |\n", "| * | 匹配前面的子表达式零次或多次。例如,zo* 能匹配 \"z\" 以及 \"zoo\"。* 等价于{0,}。 |\n", "| + | 匹配前面的子表达式一次或多次。例如,'zo+' 能匹配 \"zo\" 以及 \"zoo\",但不能匹配 \"z\"。+ 等价于 {1,}。 |\n", "| ? | 匹配前面的子表达式零次或一次。例如,\"do(es)?\" 可以匹配 \"do\" 、 \"does\" 中的 \"does\" 、 \"doxy\" 中的 \"do\" 。? 等价于 {0,1}。 |\n", "| {n} | n 是一个非负整数。匹配确定的 n 次。例如,'o{2}' 不能匹配 \"Bob\" 中的 'o',但是能匹配 \"food\" 中的两个 o。 |\n", "| {n,} | n 是一个非负整数。至少匹配n 次。例如,'o{2,}' 不能匹配 \"Bob\" 中的 'o',但能匹配 \"foooood\" 中的所有 o。'o{1,}' 等价于 'o+'。'o{0,}' 则等价于 'o*'。 |\n", "| {n,m} | m 和 n 均为非负整数,其中n <= m。最少匹配 n 次且最多匹配 m 次。例如,\"o{1,3}\" 将匹配 \"fooooood\" 中的前三个 o。'o{0,1}' 等价于 'o?'。请注意在逗号和两个数之间不能有空格。 |\n", "\n", "## 具体代码实现以及讲解\n", "\n", "首先我们来统计论文页数,也就是在`comments`字段中抽取pages和figures和个数,首先完成字段读取。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:29:48.711453Z", "start_time": "2021-01-02T07:29:48.059043Z" } }, "outputs": [], "source": [ "# 导入所需的package\n", "import seaborn as sns #用于画图\n", "from bs4 import BeautifulSoup #用于爬取arxiv的数据\n", "import re #用于正则表达式,匹配字符串的模式\n", "import requests #用于网络连接,发送网络请求,使用域名获取对应信息\n", "import json #读取数据,我们的数据为json格式的\n", "import pandas as pd #数据处理,数据分析\n", "import matplotlib.pyplot as plt #画图工具" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:30:10.507358Z", "start_time": "2021-01-02T07:29:49.676050Z" } }, "outputs": [], "source": [ "def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi',\n", " 'report-no', 'categories', 'license', 'abstract', 'versions',\n", " 'update_date', 'authors_parsed'], count=None):\n", " '''\n", " 定义读取文件的函数\n", " path: 文件路径\n", " columns: 需要选择的列\n", " count: 读取行数\n", " '''\n", " \n", " data = []\n", " with open(path, 'r') as f: \n", " for idx, line in enumerate(f): \n", " if idx == count:\n", " break\n", " \n", " d = json.loads(line)\n", " d = {col : d[col] for col in columns}\n", " data.append(d)\n", "\n", " data = pd.DataFrame(data)\n", " return data\n", "\n", "data = readArxivFile('arxiv-metadata-oai-snapshot.json', ['id', 'abstract', 'categories', 'comments'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对pages进行抽取:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:30:15.199110Z", "start_time": "2021-01-02T07:30:10.718931Z" } }, "outputs": [], "source": [ "# 使用正则表达式匹配,XX pages\n", "data['pages'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* pages', str(x)))\n", "\n", "# 筛选出有pages的论文\n", "data = data[data['pages'].apply(len) > 0]\n", "\n", "# 由于匹配得到的是一个list,如['19 pages'],需要进行转换\n", "data['pages'] = data['pages'].apply(lambda x: float(x[0].replace(' pages', '')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "对pages进行统计,统计结果如下:论文平均的页数为17页,75%的论文在22页以内,最长的论文有11232页。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:30:27.468809Z", "start_time": "2021-01-02T07:30:27.383009Z" } }, "outputs": [ { "data": { "text/plain": [ "count 1089180\n", "mean 17\n", "std 22\n", "min 1\n", "25% 8\n", "50% 13\n", "75% 22\n", "max 11232\n", "Name: pages, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['pages'].describe().astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来按照分类统计论文页数,选取了论文的第一个类别的主要类别:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:30:59.170351Z", "start_time": "2021-01-02T07:30:58.096126Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# 选择主要类别\n", "data['categories'] = data['categories'].apply(lambda x: x.split(' ')[0])\n", "data['categories'] = data['categories'].apply(lambda x: x.split('.')[0])\n", "\n", "# 每类论文的平均页数\n", "plt.figure(figsize=(12, 6))\n", "data.groupby(['categories'])['pages'].mean().plot(kind='bar')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来对论文图表个数进行抽取:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:31:16.225134Z", "start_time": "2021-01-02T07:31:12.823092Z" } }, "outputs": [], "source": [ "data['figures'] = data['comments'].apply(lambda x: re.findall('[1-9][0-9]* figures', str(x)))\n", "data = data[data['figures'].apply(len) > 0]\n", "data['figures'] = data['figures'].apply(lambda x: float(x[0].replace(' figures', '')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "最后我们对论文的代码链接进行提取,为了简化任务我们只抽取github链接:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:32:16.121702Z", "start_time": "2021-01-02T07:32:15.033667Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " \n", "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " # Remove the CWD from sys.path while we load stuff.\n" ] } ], "source": [ "# 筛选包含github的论文\n", "data_with_code = data[\n", " (data.comments.str.contains('github')==True)|\n", " (data.abstract.str.contains('github')==True)\n", "]\n", "data_with_code['text'] = data_with_code['abstract'].fillna('') + data_with_code['comments'].fillna('')\n", "\n", "# 使用正则表达式匹配论文\n", "pattern = '[a-zA-z]+://github[^\\s]*'\n", "data_with_code['code_flag'] = data_with_code['text'].str.findall(pattern).apply(len)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "并对论文按照类别进行绘图:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2021-01-02T07:32:29.528795Z", "start_time": "2021-01-02T07:32:29.374662Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "data_with_code = data_with_code[data_with_code['code_flag'] == 1]\n", "plt.figure(figsize=(12, 6))\n", "data_with_code.groupby(['categories'])['code_flag'].count().plot(kind='bar')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }